UPSTREAM PR #15719: ggml : block repack support for Q4_K quanti for AArch64 architecture#612
UPSTREAM PR #15719: ggml : block repack support for Q4_K quanti for AArch64 architecture#612
Conversation
* new quanti: block_q4_kx4 with offline repack impl * new quantize path: neon impl for ggml_quantize_mat_q8_K_4x8 * new gemv kernel: ggml_gemv_q4_K_4x8_q8_K based on dotprod * new gemm kernel: ggml_gemm_q4_K_4x8_q8_K based on i8mm * performance boost for both S_PP and S_TG --------- Co-authored-by: yuanjia111 <yuan.jia@sanechips.com.cn>
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #612OverviewThis PR introduces Q4_K block repacking optimization for AArch64 architecture using NEON SIMD instructions. The implementation adds a new 4x8 interleaving pattern ( Key FindingsPerformance-Critical Function ChangesMatrix Quantization (
Matrix-Vector Multiply (
Matrix-Matrix Multiply (
Impact on Inference PerformanceThe modified functions operate at the quantization and low-level matrix operation layer, not the tokenization or inference orchestration layer. Functions like Power Consumption AnalysisBinary:
The power increase is concentrated in the CPU backend library where the new repacking logic executes. The 2.10% increase represents the energy cost of pre-processing (repacking) operations that enable faster runtime execution on NEON-capable hardware. |
cc6b7b1 to
7ceec3c
Compare
c1a0f77 to
f14c301
Compare
Mirrored from ggml-org/llama.cpp#15719
This PR improves q4_k_q8_k kernel with block repacking support for AArch64 architecture, based on NEON.
Following structures and functions are implemented:
block_q4_kx4based on four q4_k blocks, along with offline repacking functionblock_q8_Kx4inggml_quantize_mat_q8_K_4x8()ggml_gemv_q4_K_4x8_q8_K()NEON kernel forGGML_OP_MUL_MAT_ID/GGML_OP_MUL_MATopsggml_gemm_q4_K_4x8_q8_K()NEON kernel forGGML_OP_MUL_MAT_ID/GGML_OP_MUL_MATopsTest environment
Bench results
Good gains were observed with this PR, for both S_PP and S_TG:
(1) meta-llama-3-8b-instruct.Q4_K_M.gguf
(original)
(this PR)
speedup
(original)
(this PR)
speedup
(2) DeepSeek-V3-Q4_k_M.gguf
(original)
(this PR)
speedup
(original)
(this PR)
speedup
Perplexity
(1) meta-llama-3-8b-instruct.Q4_K_M.gguf
(2) DeepSeek-V3-Q4_k_M.gguf
Reference
PS: the x86 patch share the same structure
block_q8_Kx4with this patch, but the detailed layout is different.